Analyzing Cellular Populations using CATCH - a tutorial:

0. Introduction

Cells occupy a hierarchy of transcriptional identities which is difficult to study in an unbiased manner when perturbed by disease. To identify, characterize, and compare clusters of cells, we present CATCH, a coarse graining framework that learns the cellular hierarchy by applying a deep cascade of manifold-intrinsic diffusion filters. CATCH includes a suite of tools based on the connection we forge between topological data analysis and data diffusion geometry to identify salient levels of the hierarchy, automatically characterize clusters and rapidly compute differentially expressed genes between clusters of interest. When used in conjunction with MELD (https://github.com/KrishnaswamyLab/MELD), CATCH has been shown to identify rare popultions of pathogenic cells and create robust disease signatures.

1. Installation and Setup:

First, let's import CATCH along with all our other packages and proceed with analysis. If you haven't already, install the CATCH package using pip as follows (requirements located at https://github.com/KrishnaswamyLab/CATCH)

2. Loading and Pre-processing Data

In this section we will download 10X human Peripheral Blood Mononuclear Cell (PBMC) data to your local computer and pre-process it for analysis with CATCH.

Now that we have loaded the data, we will remove cells with low transcript counts (less than 1000 counts per cell) and unexpressed genes (expressed in less than 5 cells):

Finally, we will library size normalize and square root transform the expression data as is standard in single cell analysis:

Next, we'll run the PHATE (https://www.nature.com/articles/s41587-019-0336-3), a dimensionality reduction algorithm previously developed by the Krishnaswamy lab to produce powerful visualizations.

3. Running CATCH on our dataset

First, we run the CATCH condensation process on the data as follows:

  1. Compute the CATCH operator on our dataset
  2. Identify clustering granularities amenable for analysis via topological activity analysis
  3. Visualize condensation homology
  4. Visualize our datasets with the chosen levels of resolution
  5. Compute differentially expressed genes between clusters using condensed transport

Next, we identify granularities for downstream analysis using topological activity analysis:

We can now visualize the topological activity and identify granularities which partition single cells into meaningul, stable clusters:

Finally, we can visualize the CATCH computed clusters on our initial PHATE embedding across stable granularities:

Now we can compute the condensation homology and visualize some of these stable clusters on the tree to help us identify optimal granularities levels for downstream analysis:

We often find that rotating the condensation homology plot allows for the best visualization of cluster seperation:

Coarse grained analysis of cellular subsets

Since granularity level 98 seems like a reasonable seperation of the coarse clusters present within the data, we will continue differenential expression analysis with these sets of clusters. Please feel free to play around with granularity levels however!

Now, in order to identify cluster specific genes, we can perform differential expression analysis between cellular populations of interest through condensed transport analysis. The condensed transport command is a little involved, cluster{1,2} denote the cluster label assigned by CATCH at granularities level{1,2} respectively. For instance in the following command, we are comparing cluster 0 to cluster 3 both of which are found at granularity 98. Feel free to play around with cluster labels and granularities!

Now we can visualize most differntially expressed genes between clusters 0 and 3:

We can compare our condensed transport values with ground truth gene transport values below:

Fine grained analysis of cellular subsets

Next we can look to a finer granularity of the condensation homology to identify more rare populations of cells and create more refined celltype specific signatures:

Now we can visualize most differntially expressed genes between clusters 0 and 3:

Thank you for reviewing our tutorial -- please feel free to change parameters, test out different functionality and report errors on the github page!